Constrained Coclustering for Textual Documents

نویسندگان

  • Yangqiu Song
  • Shimei Pan
  • Shixia Liu
  • Furu Wei
  • Michelle X. Zhou
  • Weihong Qian
چکیده

In this paper, we present a constrained co-clustering approach for clustering textual documents. Our approach combines the benefits of information-theoretic co-clustering and constrained clustering. We use a two-sided hidden Markov random field (HMRF) to model both the document and word constraints. We also develop an alternating expectation maximization (EM) algorithm to optimize the constrained coclustering model. We have conducted two sets of experiments on a benchmark data set: (1) using human-provided category labels to derive document and word constraints for semi-supervised document clustering, and (2) using automatically extracted named entities to derive document constraints for unsupervised document clustering. Compared to several representative constrained clustering and co-clustering approaches, our approach is shown to be more effective for high-dimensional, sparse text data.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Co-Clustering the Documents and Words Using-IJCSEC

In this paper, we propose a novel constrained coclustering method to achieve two goals. First, we combine information theoretic coclustering and constrained clustering to improve clustering performance. Second, we adopt both supervised and unsupervised constraints to demonstrate the effectiveness of our algorithm. The unsupervised constraints are automatically derived from existing knowledge so...

متن کامل

Text Categorization Using Word Similarities Based on Higher Order Co-occurrences

12 In this paper, we propose an extension of the χ-Sim coclustering algorithm to deal with the text categorization task. The idea behind χ-Sim method [1] is to iteratively learn the similarity matrix between documents using similarity matrix between words and vice-versa. Thus, two documents are said to be similar if they share similar (but not necessary identical) words and two words are simila...

متن کامل

An Algorithm for Constrained Association Rule Mining in Semi-structured Data

The need for sophisticated analysis of textual documents is becoming more apparent as data is being placed on the Web and digital libraries are surfacing. This paper presents an algorithm for generating constrained association rules from textual documents. The user speciies a set of constraints, concepts and/or structured values. Our algorithm creates matrices and lists based on these prespecii...

متن کامل

Using Fuzzy LR Numbers in Bayesian Text Classifier for Classifying Persian Text Documents

Text Classification is an important research field in information retrieval and text mining. The main task in text classification is to assign text documents in predefined categories based on documents’ contents and labeled-training samples. Since word detection is a difficult and time consuming task in Persian language, Bayesian text classifier is an appropriate approach to deal with different...

متن کامل

Textual Cbr and Information Retrieval { a Comparison {

In recent years, quite a number of projects started to apply case-based reasoning technology to textual documents instead of highly structured cases. For this the term Textual CBR has been coined. In this paper, we give an overview over the main ideas of Textual CBR and compare it with Information Retrieval techniques. We also present some preliminary results obtained from three projects perfor...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010